Background

The video game sector is currently one of the biggest entertainment industries and, as of 2019, is thought to be worth over $145 billion dollars. As the gaming industry has grown, so too has the sheer number of games available for consumers to choose from. With so much choice, the opinions and reviews by games critics is often sought after by many people who engage in gaming as a hobby, especially as these reviews are usually available prior to a games release. Many of the more popular review websites for media allows for general users to publish their own reviews as well. Logically, it makes sense to assume that good games will be reviewed more highly and therefore sell better than bad games with poorer review scores. As such, this project sought to examine whether this assumption is true or not.

Research Question

  • Does the review score for a game have any relationship to how well it sold?

Data Origins.

The data used here was published by Rush Kirubi and can be accessed and downloaded on Kaggle. To create this data set, data was scraped from two websites in 2016:

  • Metacritic, a website which contains rating scores from critics and users.
  • VGChartz, a website containing the regional and global sale numbers for games.

Unfortunately, the original web scraping code which created this data set no longer works following structural changes of the metacritic and VGChartz websites, and has been removed from Kaggle as a result. I was therefore unable to create a more up to date version of this data set. However, despite not being up to date, the completed data set from 2016 is still sufficient for answering my research question.

Data Preparation

This data contains 16 variables, the explanations for which are explained in the below table. These explanations can also be found in my code book in the GitHub repository.

Variable Name Meaning
Name The name of the game
Platform Which console of platform the game was released on
Year_of_Release The year in which the game was released
Genre The genre of the game
Publisher The name of the company which published the game
NA_Sales The number of copies of a game sold in North America (in millions)
EU_Sales The number of copies of a game sold in Europe (in millions)
JP_Sales The number of copies of a game sold in Japan (in millions)
Global_Sales The number of copies of a game sold globally (in millions)
Critic_Score The average review score given by video game critics on a scale of 0-100
Critic_Count The number of critics who gave a review score for a game
User_Score The average review score given by general users of the metacritic website on a scale of 0-10
User_Count The number of general users who gave review scores for a game
Developer The name of the company which developed the game
Rating The ESRB rating for a game

Firstly, I had to check the data was able to load and display correct in R. Observing the first few lines of the data let me know whether the data had loaded in properly or not.

#Load the necessary packages for the code
library(here)
library(tidyverse)
library(ggExtra)
library(dplyr)
library(plotly)

#Load the data into the workspace
Data <- read.csv(here("Data", "VideoGameSales2016.csv"))

#Check the data loaded correctly
head(Data)
##                       Name Platform Year_of_Release        Genre Publisher
## 1               Wii Sports      Wii            2006       Sports  Nintendo
## 2        Super Mario Bros.      NES            1985     Platform  Nintendo
## 3           Mario Kart Wii      Wii            2008       Racing  Nintendo
## 4        Wii Sports Resort      Wii            2009       Sports  Nintendo
## 5 Pokemon Red/Pokemon Blue       GB            1996 Role-Playing  Nintendo
## 6                   Tetris       GB            1989       Puzzle  Nintendo
##   NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count
## 1    41.36    28.96     3.77        8.45        82.53           76           51
## 2    29.08     3.58     6.81        0.77        40.24           NA           NA
## 3    15.68    12.76     3.79        3.29        35.52           82           73
## 4    15.61    10.93     3.28        2.95        32.77           80           73
## 5    11.27     8.89    10.22        1.00        31.37           NA           NA
## 6    23.20     2.26     4.22        0.58        30.26           NA           NA
##   User_Score User_Count Developer Rating
## 1          8        322  Nintendo      E
## 2                    NA                 
## 3        8.3        709  Nintendo      E
## 4          8        192  Nintendo      E
## 5                    NA                 
## 6                    NA

I then decided to create a new dataframe which only contained a subset of these variables so it would be more straightforward to work with and manipulate the data as needed whilst also leaving the original data intact. The following code created a new dataframe containing the Name, Global_Sales, Critic_Score and User_Score columns using the cbind feature in the dplyr package and prints the first few instances of data.

#Create new dataframe containing the Name, Critic_Score, User_Score and Global_Sales columns
Data2 <- cbind(Data[c("Name", "Global_Sales","Critic_Score", "User_Score")])

#Check it has loaded using the head() function
head(Data2)
##                       Name Global_Sales Critic_Score User_Score
## 1               Wii Sports        82.53           76          8
## 2        Super Mario Bros.        40.24           NA           
## 3           Mario Kart Wii        35.52           82        8.3
## 4        Wii Sports Resort        32.77           80          8
## 5 Pokemon Red/Pokemon Blue        31.37           NA           
## 6                   Tetris        30.26           NA

Once the data was loaded in correctly, it was important to go through a number of checks to see how the data was formatted within R. Firstly, I checked what data type the Critic_Score and User_Score columns were in. Although straightforward, this was an important step because I knew that both of these variables would need to be the same type in order to combine into a single score. Critic_score was automatically stored as integer, and User_score was stored as character data. The code to convert these variables into the same type of data is shown below.

#Check what class of data Critic and User scores are stored as 
class(Data2$Critic_Score)
class(Data2$User_Score)
#Convert both user and critic scores to numeric
Data2$Critic_Score <- as.numeric(Data2$Critic_Score)
Data2$User_Score <- as.numeric(Data2$User_Score) #This flags a warning in R, but can be ignored

I also checked for the number of complete cases of both the Critic_Score and User_Score variables, as I was aware that the metacritic website does not contain information for certain older platforms such as the SNES. As such, I expected there would be a number of missing cases for these variables in the data set.

#Check the number of complete cases which have values for critic and user score together
length(na.omit(Data2$Critic_Score & Data2$User_Score))
## [1] 7018

Preparing the dataset.

In order to combine Critic_Score and User_Score, it was necessary to convert these to the same scale. In the default data set, critic scores are ranked on a scale of 0-100, whilst user scores are ranked on a scale of 0-10. I decided it would be more appropriate to convert the User_Score into a 0-100 scale, rather than converting the Critic_Score to a 0-10 scale. I thought it would be easier to visualize these ratings if they were all on a scale of 0-100 rather than a 0-10 scale. A 0-100 scale also felt more intuitive to understand.
To do this, the mutate function was used to multiply every valid User_Score by 10. For example, a score of 8 would be converted into a score of 80. Following this, I created another variable in my data set which contained the overall review score for a game. For this, I averaged both the Critic_Score and User_Score and added the result as a new variable.

#Convert the User_Score values into the same scale as the Critic_Score
Data2 <- Data2 %>%
  mutate(User_Score = User_Score * 10)

#Create a new variable containing the average review score of critics and users for a given game
Data2$Overall_Score <- (Data2$User_Score + Data2$Critic_Score) / 2

#Display first few rows of edited dataset
head(Data2)
##                       Name Global_Sales Critic_Score User_Score Overall_Score
## 1               Wii Sports        82.53           76         80          78.0
## 2        Super Mario Bros.        40.24           NA         NA            NA
## 3           Mario Kart Wii        35.52           82         83          82.5
## 4        Wii Sports Resort        32.77           80         80          80.0
## 5 Pokemon Red/Pokemon Blue        31.37           NA         NA            NA
## 6                   Tetris        30.26           NA         NA            NA

Visualisation 1

For my first graph I decided to create a smoothed line graph. I decided to smooth the data so it would be easier to interpret overall trends and get a better overall idea of the data set.

#Create a smoothed line graph of review scores and global sale figures
g <- ggplot(data = na.omit(subset(Data2, Overall_Score >20 & Overall_Score <95)), aes(x = Data2$Overall_Score, y = Data2$Global_Sales))
g2 <- g + geom_smooth(aes(x = Overall_Score, y = Global_Sales), method = "loess",colour = 'darkred', size = .7, se = FALSE) +
  #Limit xaxis
  xlim(0, 100)+
  #Create title and axis labels
  labs(x = "Review Score", y = "Global sales (units/millions)", title = "Game Review Scores and Global Sales")+
  #Specify title text size and location
  theme(plot.title = element_text(size = 12, face = 'bold', hjust = 0.5))+
  #Change axis text
  theme(axis.text = element_text(color = 'black', size = 6))+
  #Change background to white
  theme(panel.background = element_rect(fill = 'white'))

#View graph
g2

Visualisation 1 Summary

This graph gives a very clear indication of the trends of this data set. It shows that as review score increases, so does global sales. The main issue I had with this graph was that, when examining the first few lines of the data earlier I had noted some games (such as Wii Sports) had global sales as high as 82.53 million global sales. This made me suspicious that some outliers may have had a heavy influence on the trends being observed here. I therefore decided to make a different plot to see if I was able to visualise trends which were more representative of the gaming industry in general.

Visualisation 2

After suspecting that visualisation 1 was being heavily influenced by outliers, I decided that a joint plot would be a more appropriate way to answer my question. A scatterplot would allow me to observe the review scores for each individual case, and compare it to how many units (in millions) were sold globally. The histogram would give me an indication of how the data is distributed overall. After looking at the first few instances of data from using the head command, I knew there were some outliers in terms of global sale figures. To create this joint plot, I used the ggExtra package, which is able to create graphs similarly to those which can be created from the Seaborn library in python.

#Create a joint scatter and histogram plot using the ggExtra package
#This compared the global unit sales of a game with the overall review score
#Specify dataframe and axes
plot_center = ggplot(Data2, aes(x = Overall_Score, y = Global_Sales))+ 
  #Create scatterplot and specify point colour and size
  geom_point(color = 'darkred', alpha = .3, size = .8, stroke = 0)+ 
    #Create title and axis labels
    labs(x = "Review Score", y = "Global sales (units/millions)", title = "Game Review Scores and Global Sales")+
    #Specify title text size and location
    theme(plot.title = element_text(size = 12, face = 'bold', hjust = 0.5))+
    #Change axis text
    theme(axis.text = element_text(color = 'black', size = 6))+
    #Change background to white
    theme(panel.background = element_rect(fill = 'white'))
ggMarginal(plot_center, type = "histogram", 
                          color = 'black', fill = 'darkred')

The first iteration of this plot is near impossible to interpret due to high clusters of data below 2 million global sales, and some outliers with incredibly high global sales (Wii Sports was the biggest culprit here). I therefore decided to run some code to check the range, mean and standard deviation of my global sale data. Based on the results of this, it made sense to recreate the initial visualization with more sensible boundaries, so the pattern and distribution of the data could be better identified.

Refining Visualisation 2

#Examining the range, mean and SD of the global sale data
range(Data2$Global_Sales)
mean(Data2$Global_Sales)
sd(Data2$Global_Sales)
#Determining the average +/- 1SD to set the axis limits on future graphs
mean(Data2$Global_Sales)+sd(Data2$Global_Sales)
## [1] 2.081478

After identifying how my data was distributed, I decided to restrict my axes. I decided that the average +/- 1 standard deviation would be an appropriate way to restrict the global sales data. Consequently, only games which had less than 2.1 million sales were included in the final visualization.

#Joint plot as above, restricting shown data to global sales less than 2.1 million units
plot_center = ggplot(data = subset(Data2, Global_Sales < 2.1), aes(x = Overall_Score, y = Global_Sales))+ 
  #Create scatterplot and specify point colour and size
  geom_point(color = 'darkred', alpha = .3, size = .8, stroke = 0)+ 
    #Create title and axis labels
    labs(x = "Review Score", y = "Global sales (units/millions)", title = "Game Review Scores and Global Sales")+
    #Specify title text size and location
    theme(plot.title = element_text(size = 12, face = 'bold', hjust = 0.5))+
    #Change axis text
    theme(axis.text = element_text(color = 'black', size = 6))+
    #Change background to white
    theme(panel.background = element_rect(fill = 'white'))
ggMarginal(plot_center, type = "histogram", 
                          color = 'black', fill = 'darkred')

Visualisation 2 Summary

This visualization highlights the spread of the data very well. It also shows that games with higher scores are more likely to sell well than games with very low review scores. However, the histogram highlights that although a majority of games scored around 80 on their review score, most did not sell more than 200,000 copies. This is interesting, as it suggests that just because a game is received well, it does not guarantee it will sell well. In fact, there does seem to be a slight drop off in sale figures for games which receive very high review scores, but it is difficult to acertain why this is the case based on the data available.

Interactive visualisation

Lastly, I decided to create an interactive plot of this data. Although the above graphs were able to answer my research question, I thought it would be interesting to see whether I could create a graph which could allow for the inspection of individual outliers or specific clusters of points. As the relationship between review scores and global sale figures does not appear to be especially strong, I think this is a good way of looking at areas of the data, especially when looking at surprising individual cases (Wii Sports is, again, a great example for this).

#Create an interactive scatterplot of the overall/user/critic vs global sales
intplot <- plot_ly(data = Data2, x = Data2$Overall_Score, y = Data2$Global_Sales, 
                   #Add hover labels to individual points on the plot
                   text = ~paste("Review Score: ", Data2$Overall_Score, 
                                 "<br>Global Sales: ", Data2$Global_Sales, 
                                 "<br>Game: ", Data2$Name), 
                          hoverinfo = "text",
            #Specify type of plot
            type = "scatter", mode = "markers", 
            #Specify point aesthetics
            marker = list(size = 4,
                          color = 'rgba(170, 30, 15, .7)',
                          line = list(color = 'rgba(100, 30, 15, .7)',
                                      width = 1)))
#Add title and axis labels
intplot <- intplot %>%
  layout(title = "Global Sales and Review Scores",
         xaxis = list(title = "Review Score", range = c(0, 100)),
         yaxis = list(title = "Global Sales (per million units)", range = c(0, 85)),
         font = list(color = '#21130d',
                     size = 14))
#Format graph background
intplot <- intplot %>%
  layout(plot_bgcolor = 'rgb(254, 254, 254)') %>%
  layout(paper_bgcolor = 'rgb(254, 254, 254)') %>%
  layout(xaxis = list(showgrid = FALSE),
         yaxis = list(showgrid = FALSE)) %>%
  layout(margin = list(b = 50, l = 50, r = 50, t = 50))

#Display plot
intplot

The interactive plot is really useful for this data due to how clustered it is in certain areas. It allows for an overall view of the data, as well as for a close inspection of clustered areas. The ability to hover over points and see the global sale number, review score, and name of the game is really beneficial when looking at individual points where data has clustered. Originally, I had also included the publisher information in the hover points as I suspected certain publishers would be behind some of the outliers. I decided to remove this for the final iteration of the interactive plot because it looked quite cluttered and this information was not available for every point on the plot so was inconsistent. Nevertheless, the Plotly package used to create this, is an excellent tool for data visualization.

Conclusion

Visualisation 1 produced a very clear graph presenting the broad trends of the entire data set. However, this also meant that the outliers (Wii Sports) had a significant influence on the overall picture presented. The reality is that a majority of games do not reach 1 million copies sold, so the steep increase towards the higher review scores on this graph is misleading. The joint scatter-histogram plot made for visualisation 2 gave a good indication of how sales and reviews related to one another, as well as information about the distribution of the data set. Whilst this plot is arguably ‘messier’, I think it provides a better understanding of the data. It shows a subset of the data which represents a majority of games, opposed to visualisation 1 which was misleading due to the pull of outliers.

The interactive plot is a good addition as well, as it allows for looking at clusters and outliers more easily than a static plot is able to, given the sheer number of data points plotted. Being able to see the global sale figure and review score for an individual game at a glance is also a benefit to these types of interactive plots.

Future Visualisations

If a more up to date data set could be obtained, it would be interesting to see whether these trends (or lack thereof) have changed as a result of the Coronavirus pandemic of 2020. Following repeated lock downs, the gaming hobby has seen a surge in popularity over the past two years, as well as the release of some highly anticipated games, such as Animal Crossing: New Horizons and Elden Ring. This data set will also be missing games from more recent consoles such as the Nintendo Switch, which saw a boom in sales following the pandemic. It may also be interesting to examine these trends on a publisher-by-publisher basis to see whether different companies see different relationships between review scores and global sale figures. A brief look at this data set certainly suggests that Nintendo are likely to see particularly high global sale figures for their large IPs (such as Mario, or Wii Sports spin-offs).

Extra Animated plot

One of the things I wanted to learn to do as part of this module was to animate graphs. Ultimately, this was not the most appropriate way to show off the data I had obtained, but I attempted it nevertheless. The gganimate package is a fun tool to animate plots. One of the major issues I had with this package for my data was that because of the size of my data set, my current hardware was unable to process particularly large animations without failing and crashing. The solution for this was to restrict the number of frames and the duration of the animation, but this did result in a very jerky plot animation shown below.

library(gganimate)
library(gifski)

#Smooth line graph of review scores and global sale figures
g <- ggplot(data = na.omit(subset(Data2, Overall_Score >20 & Overall_Score <90)), aes(x = Data2$Overall_Score, y = Data2$Global_Sales))
g2 <- g + geom_smooth(aes(x = Overall_Score, y = Global_Sales), method = "loess",colour = 'darkred', size = .7, se = FALSE) +
  #Limit xaxis
  xlim(0, 100)+
  #Create title and axis labels
  labs(x = "Review Score", y = "Global sales (units/millions)", title = "Game Review Scores and Global Sales")+
  #Specify title text size and location
  theme(plot.title = element_text(size = 12, face = 'bold', hjust = 0.5))+
  #Change axis text
  theme(axis.text = element_text(color = 'black', size = 6))+
  #Change background to white
  theme(panel.background = element_rect(fill = 'white'))

#Animate the above plot
anim <- g2 +
  transition_reveal(Overall_Score)

#View animation
animate(anim, nframes = 35, duration = 7)

References

The GitHub repository containing all the code and resources used for this project can be found here.